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Abstract 

Most of machine learning deals with vector parameters. Ideally we would like to take 
higher order information into account and make use of matrix or even tensor parameters. 
However the resulting algorithms are usually inefficient. Here we address on-line learning 
with matrix parameters. It is often easy to obtain online algorithm with good generalization 
performance if you eigendecompose the current parameter matrix in each trial (at a cost 
of 0{n^) per trial). Ideally we want to avoid the decompositions and spend 0{n^) per 
trial, i.e. linear time in the size of the matrix data. There is a core trade-off between 
the running time and the generalization performance, here measured by the regret of the 
on-line algorithm (total gain of the best off-line predictor minus the total gain of the on-line 
algorithm). 

We focus on the key matrix problem of rank k Principal Component Analysis in i?" 
where k n. There are O(n^) algorithms that achieve the optimum regret but require 
eigendecompositions. We develop a simple algorithm that needs O(kn^) per trial whose 
regret is off by a small factor of 0(n^/^). The algorithm is based on the Follow the Perturbed 
Leader paradigm. It replaces full eigendecompositions at each trial by the problem finding 
k principal components of the current covariance matrix that is perturbed by Gaussian 
noise. 

1. Introduction 

In Principal Component Analysis (PCA), the data points Xt G are projected onto 
a fc-dimensional subspace (represented by a rank k projection matrix P). The goal is 
to maximize the total squared norm of the projected data points, This is 

equivalent to finding the principal eigenvectors ui,... ,Uk i.e. those belonging to the k 
largest eigenvalues of the data “covariance matrix” '^tXtxJ, and setting the projection 
matrix to P = . In this paper we choose the online version of PCA (Warmuth 

and Kuzmin, 2008) as our paradigmatic matrix parameter problem and we explore the core 
trade-off between generalization performance and time efficiency per trial for this problem. 
In each trial t = 1,... ,T, the online PCA algorithm chooses a projection matrix Pt of rank 
k based on the previously observed points xi,..., Xt-i. Then a next point Xt is revealed 
and the algorithm receives gain ||Pta;t|p. The goal here is to obtain an online algorithm 
whose cumulative gain over trials t = 1,... ,T is close to the cumulative gain of the best 
rank k projection matrix chosen in hindsight after seeing all T instances. The maximum 
difference between the cumulative gain of the best off-line comparator and the cumulative 
gain of the algorithm and is called the (worst-case) regret. 

© . 


Kotlowski and Warmuth 


If you use the principal eigenvectors Ui (1 < i < k) as the parameters, then the gain 
is formidably non-convex. However the key insight of (Warmuth and Kuzmin, 2006, 2008) 
is the observation that the seemingly quadratic gain UPaJilp is a linear function of the 
projection matrix P when the data is expressed in terms of the matrix instance xtxj 
rather than vector instance Xt: 

\\Pxtf = XtP^Xt ^ = ^ XtPxt = tr{PxtxJ). 

Good algorithms hedge their bets by predicting with a random projection matrix. In that 
case E [jlPaj^lp] becomes tr(E[P] XtxJ ). Thus it is natural to use mixtures E[P] of rank k 
projection matrices as the parameter matrix of the algorithm. Such mixture are positive 
definite matrices of trace k with eigenvalues capped at 1. The gist is that the gain is now 
linear in this alternate parameter matrix and the non-convexity has been circumvented. 
This observation is the starting point for lifting known online learning algorithms for linear 
gain/loss on vector instances to the matrix domain, which resulted in the Matrix Expo¬ 
nentiated Gradient (MEG) algorithm (Tsuda et ah, 2005; Arora and Kale, 2007; Warmuth 
and Kuzmin, 2008), as well as the (matrix) Gradient Descent (GD) algorithm (Arora et ah, 
2013, 2012; Jiazhong et ah, 2013). Both algorithms are motivated by trading off a Breg- 
man divergence against the gain, followed by a Bregman projection onto the convex hull 
of rank k projection matrices which is our parameter space. The worst-case regret of these 
algorithms for online PGA is optimal (within constant factors). Furthermore MEG remains 
optimal for a generalization of the PGA problem to the dense instance case in which the 
“sparse” rank one outer products XtxJ of vanilla PGA are generalized to positive definite 
matrices Xt with bounded eigenvalues. (Jiazhong et ah, 2013). 

Unfortunately, both algorithms require full eigendecomposition of the parameter matrix 
at a cost of O(n^) per trial.^ It was posed as an open problem (Kazan et ah, 2010) whether 
there exists an algorithm with good regret guarantees requiring time comparable with that 
of finding the top k eigenvectors. The latter operation operation can be done efficiently 
by means of e.g. power iteration based methods (Arnoldi, 1951; Gullum and Willoughby, 
1985): It essentially requires time 0{kn^), which is much less than the cost of a full eigen¬ 
decomposition in the natural case when A: <C n. This operation is also used by the simple 
Follow the Leader algorithm which predicts with the k principal components of the current 
covariance matrix. This algorithm performs well when the data is i.i.d. but can be forced 
to have large regret on worst-case data. 

In this paper, we provide an algorithm based on the Follow the Perturbed Leader (FPL) 
approach, which perturbs the cumulative data matrix by adding a random symmetric noise 
matrix, and then predicts with the k principal components of the current perturbed co- 
variance matrix. The key question is what perturbation to use and whether that exists a 
perturbation for which FPL achieves close to optimal regret. In the vanilla vector parameter 
based FPL algorithm (Kalai and Vempala, 2005), exponentially distributed perturbation 
lead to optimal algorithms when the perturbations is properly scaled. We could apply the 
same perturbations to the eigenvalues of the current parameter matrix and achieve optimal 
regret. However this approach requires us to eigendecompose the current parameter matrix 

1. For the rank one instances xtxJ of PCA the update of the eigenvalues takes 0{iP) Bunch et al. (1978/79) 

per trial. However the update of the eigensystem remains 0{n^). 
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and this defeats the purpose. We need to find a perturbation that requires O(n^) time 
to compute instead of 0{n^). We use a random symmetric Gaussian matrix (a so called 
Gaussian orthogonal ensemble), which consists of entries generated i.i.d. from a Gaus¬ 
sian distribution. Our approach is more similar to the recent algorithm based on Random 
Walk Perturbation Devroye et al. (2013) and can be considered as a matrix generalization 
thereof, drawing connections to Random Matrix Theory Tao (2012). Calculation of our 
random noise matrix requires O(n^) and hence the total computational time is dominated 
by finding the k principal components of the perturbed matrix. At the same time, our 
algorithm achieves 0{n^^‘^VkT) worst-case regret for online PCA (sparse instances) and 
0{ky/nT) worst-case regret for dense instance case. Comparing to the minimax regrets 
Q{y/kT) and Q{k\/T logn) in the sparse and dense cases, respectively, we are only a factor 
of 0(n^/^) and 0{ off from the optimum, respectively. 

Our approach can be considered a generalization of the Random Walk Perturbation 
(RWP) algorithm Devroye et al. (2013) to the matrix domain. In RWP, an independent 
Bernoulli coin flip is added to each component of the loss/gain vector, the process which 
can be closely approximated (through Central Limit Theorem) by Gaussian perturbations 
with variance growing linearly in t. This is also the case of our algorithm, where we use a 
symmetric matrix with i.i.d. Gaussian-distributed entries with variance also growing lin¬ 
early in t. Our analysis, however, resorts to properties of random matrices, e.g. expected 
maximum eigenvalue, which leads to worse regret bounds than in the vector case. Inter¬ 
estingly, comparing to Devroye et al. (2013) we get rid of additional O(logT) factor in the 
regret. 

Related work. Online PCA, in the framework considered here, was introduced in Tsuda 
et al. (2005) and independently in Arora and Kale (2007), along with Matrix Exponentiated 
Gradient algorithm. The problem of finding efficient algorithms which avoid full eigende- 
composition was posed as an open problem by Kazan et al. (2010). An efficient algorithm 
for PCA based on Online Gradient Descent was proposed in Arora et al. (2013, 2012), but 
the main version of the algorithm {Matrix Stochastic Gradient, MSG) still requires O(n^) 
in the worst case, while a faster version [Gapped MSG) operates on low-rank deterministic 
parameter matrix, which can be shown to have regret linear in T in the adversarial setting. 
The most closely related to our work is Garber et al. (2015), in which several algorithms 
are proposed for learning the top eigenvector (i.e., the simplest case /c = 1 of online PCA): 
based on online Pranke-Wolfe method Kazan and Kale (2012), and based on FPL approach 
with entry-wise uniform perturbation, exponentially-distributed perturbation, and a per¬ 
turbation based on a sparse rank-one random matrix vv~^ composed of a random Gaussian 
vector V. Except for the last algorithm, all the other approaches have regret guarantees 
which are inferior comparing to our method, either in terms of dependence on T or depen¬ 
dence on n, or both. The method based on rank-one perturbation achieves regret bound 
0{y/'nT) which is the same as ours in the dense instance case. It is not clear whether this 
method would benefit anyhow from sparsity of instance matrices (as in the standard online 
PCA), and whether it would easily generalize to fc > 1 case. 

There are other formulations of online PCA problem. For instance, Boutsidis et al. 
(2015) aims at finding low dimensional data representation in a single pass, with the goal 
of good reconstruction guarantees using only a small number of dimensions. On the other 
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hand, Balsubramani et al. (2013); Shamir (2015) consider online PCA in the stochastic 
optimization setting. However, the algorithm considered therein are not directly applicable 
in the adversarial setting studied in this work. 

New conjecture. Our algorithm is efficient, but its suboptimal regret is due to the fact 
that the noise matrix does not adapt to the eigensystem of cumulative covariance matrix. 
What O(n^) perturbation can we use that does adapt to the eigensystem of the current 
covariance matrix? A clear candidate is to use Dropout Perturbation. In the vector case 
this perturbation independently at random zeros out each component of the gain/loss vector 
in each trial (van Erven et ah, 2014) and achieves optimal regret without having to tune the 
magnitude of the perturbations. In the matrix case it would be natural to independently 
zero out components of the instance matrix when expressed in the eigensystem of the current 
loss matrix. However these approaches again require eigendecompositions. 

A new variant is to skip at trial t the entire instance matrix with probability half, i.e. at 
trial t predict with the k principal components of the following perturbed current covariance 
matrix 

t-i 

q=l 

where the at are Bernoilli coin flips with probability half. We call this the Follow the 
Skipping Leader algorithm because it skips entire instances Xf with probability half. It 
is easy to maintain this perturbed covariance matrix in O(n^) time per trial. However, 
unfortunately already in the vector parameter case this algorithm can be forced to have a 
gravely suboptimal linear regret in n (Neu and Lugosi, 2014). The counter example requires 
dense loss vectors. When lifting this counter example to the matrix setting then the regret 
can still be forced to be linear with sparse instance xtxj. However we conjecture that the 
time efficient Follow the Skipping Leader algorithm achieves the optimal regret for standard 
PCA with sparse instance. This is because in PCA regret is naturally measured w.r.t. the 
maximum gain of the best rank k subspace and not the loss. Note that that this type of 
problem is decidedly not symmetric w.r.t. gain and loss (See Jiazhong et al. (2013) for an 
extended discussion). 

Finally we also conjecture the regret bounds achieved by the algorithm of this paper 
(Gaussian perturbations) is the best you can achieve with rotation invariant noise and 
knowing this fact would be interesting in its own right. 

2. Problem setting 

In the online PCA, in each trial t = 1,...,T, the algorithm probabilistically chooses a 
projection matrix Pt G of rank k. Then a point Xt G M” is revealed and the algorithm 

receives gain = tT{PtXtxJ). Note again that the gain is linear in Pt and in 

instance matrix XtxJ. This observation calls for generalization of the online PCA in which 
the instance matrix is any positive definite matrix Xt (with bounded eigenvalues) and the 
gain becomes tr{PtXt). We call this the dense instance case as opposed to standard online 
PCA, which we call sparse instance case. 

In the above protocol, the algorithm is allowed to choose its k dimensional subspace Pt 
probabilistically. Therefore we use expected gain E[tr(PtXt)] as the evaluation of the algo- 
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rithm’s performance, where the expectation is with respect to the internal randomization 
of the algorithm. The regret of the algorithm is then the difference between the cumula¬ 
tive gain of the best off-line rank k projector and the the cumulative gain of the algorithm 
(due to linearity of gain, no randomization is necessary when considering the best off-line 
comparator): 

{ T \ T T 

J]tr(PXi)i - ^E[tr(PtXi)] = An^ (X<t) - E[tr(PtX4)], (1) 

t=i J t=i t=i 

where V denotes the set of all rank k projectors, X<t = Yl\=i cumulative data 

matrix, and Xi-,k{X) = Yli=i ^i{^) denotes the sum of top k eigenvalues of X. The goal 
of the algorithm is to have small regret for any sequence of instance matrices. Since the 
regret naturally scales with the eigenvalues of Xt, we assume for the sake of simplicity that 
all eigenvalues of Xt are bounded by 1 (i.e. for each t = 1,... ,T, the spectral norm of Xt, 
ll^tlloo < !)• 

We note that due to linearly of the gain, E[tr(PtXt)] = tr(E[Pt]^t)) so that the algo¬ 
rithm’s gain is fully determined by E[i’t], a convex combinations of rank k projection matri¬ 
ces. Hence, the parameter set of the algorithm can be equivalently taken as W = conv(7^), 
a convex hull of 7^, which is a set of positive definite matrices with trace k and all eigenvalues 
not larger than 1 Warmuth and Kuzmin (2008). This is the key idea behind MEG and GD 
algorithms, which maintain the uncertainty about projection matrix by means of a param¬ 
eter Wt G W, update their parameter by minimizing a trade-off between a divergence of 
the new and old parameter and the gain/loss of the new parameter on the current instance, 
while constraining the new parameter to lie in the parameter set W. While predicting, the 
algorithm chooses its projection matrix Pt by sampling from this mixture Wt Warmuth 
and Kuzmin (2008). 

3. The algorithm 

Our algorithm belongs to a class of Follow the Perturbed Leader (FPL) algorithms, which 
are defined by the choice: 


Pt = argmax {tr(P(X<t Nt))] , 

Pe-P 

where X^t = 'Yhqc^t ^q cumulative data matrix observed so far, while Nt is the 

symmetric noise matrix generated randomly by the algorithm^ and w.l.o.g. we assume 
E[Afi] = 0 . Perturbing the cumulative data matrix is necessary as one can easily show that 
any deterministic strategy (including Follow the Leader obtained by taking Nt = 0) can be 
forced to have regret linear in T. 

Define a “fake” prediction strategy: 

Pt = argmax {tr(P(X<t Nt))} , 

PeP 

2. Note that if the algorithm plays against oblivious adversary, it is allowed to generate the noise matrix 
once in the first trial and then reuse it throughout the game. 
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which acts as FPL, but adds the current instance Xt to the cumulative data matrix, and 
hence does not corresponds to any valid online algorithm. What follows is a standard lemma 
for bounding the FPL regret, adapted to the matrix case: 


Lemma 1 We have: 


n - Pt)Xt) 

t 


t 


A vector version of Lemma 1 can be find in standard textbooks on online learning (see, e.g., 
Cesa-Bianchi and Lugosi (2006)). Since adaptation to the matrix case is rather straightfor¬ 
ward, we defer the proof to the Appendix. 

We now specify the noise matrix our algorithm employs. Let G be an n x n matrix such 
that each entry is generated i.i.d. from a Gaussian distribution, i.e. Gij W(0,cj2). We 
define the noise matrix of our algorithm as: 

Nt = ViN, where N = ^ (g + . 

Note that Af is a symmetrized version of G, and Nt multiplies Af by \/t so that the variance 
of each entry in Nt grows linearly in t. Interestingly, distribution of N is known as Gaussian 
orthogonal ensemble in the Random Matrix Theory Tao (2012). The algorithm uses the 
variable noise rate and hence does not require any tuning for the time horizon T. We still 
have a single parameter <7^, but that parameter is only chosen based on the sparseness of 
the instance matrix. 

Note that according to rules for summing Gaussian variables, we can also express Nt as a 
sum of t independent copies of N, Nt = We thus get an equivalent picture of our 

algorithm in which in each trial, an independent noise variable N^^l generated from a fixed 
distribution is added to the current data instance Xt, and then the action of the algorithm 
is based on the sum of perturbed data instances. This pictures makes our approach similar 
to RWP algorithm and let us relate our algorithm to dropout perturbation in the next 
section. 

We now show the main result of this paper, the regret bound of the algorithm based on 
Gaussian perturbation. 


Theorem 2 Given the choice of the noise matrix Nt described above, 
• For dense instance, setting = 1 gives: 

TZ < 2ky/nT. 


For sparse instance, setting cr^ = gives: 


7 ^ < 2n^^^VWT. 


Proof We apply Lemma 1 and bound both sums on the right hand side separately. We 
start with the second sum. We have: 

[Xi,k{Nt - Nt-i)] = ^(Vt- VrM)E [Xi:k{N)] = VtE [Ai,fc(Ar)] 

i t 
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It follows from Random Matrix Theory (see, e.g., Davidson and Szarek (2001)) that the 
largest eigenvalue of a matrix generated from a Gaussian orthogonal ensemble is of order 
0{^/n), specifically: 

E [An,ax(-/V)] < V^. 

Therefore, 

E [Xi..k{N)] < m [A„,,(Ar)] < kV^, 

so that the second sum is bounded by fcV nTa"^. 

Let us now bound the first sum. First, note that N^j AA(0,cr^/2) for i ^ j and 
Nii ~ AA(0, cr^). This means that the joint density p{N) = ^(AIii,. .., Nnn) is proportional 
to: 

p{N) oc expj -^(^J^2Ar2 = exp |-^ tr(Ar2)| . 

Similarly, the joint density of Nt is proportional to: 

Pt{Nt) oc exp . 

For any symmetric matrix A, define: 

P{A) = argmax |tr(PA)| , 

P&V 


so that Pt = P{X^t + Xft) and Pt = P{X^t + Xt + Xt). Furthermore, define a function: 

ft{s) = E [tr(P(X <4 + Nt + sXt) Xt)] . 


Note that ftiO) = E [tr{PtXt)] and ft{l) = E \tr{PtXt) 


. In this notation. 


J]E [tr((Pi - Pt)Xt)] = - MO)), 


so it remains to bound /t(l) — /t(0) for all t. We have: 


Ms) = I tr(P(X<i + Nt + sXt) Xt)pt{Nt) dNt 
= I tr(P(X<i + Nt) Xt)pt{Nt - sXt) dNt, 

which follows from changing the integration variable from Nt to Nt — sXt- Since by Holder’s 
inequality: 

tv{P{X^t + Nt)Xt) < tv{P{X^t + Nt))-\\Xt\\oo = k\\Xt\M < k, (2) 

and since pt is the density of Gaussian distribution, it can easily be shown by using standard 
argument based on Dominated Gonvergence Theorem that one can replace the order of 
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differentiation w.r.t. s and integration w.r.t. Nf. This means that /t(s) is differentiable 
and: 

fiis) = j + dNt 

= ^ I tr(P(X<i + Nt) Xt) tv{{Nt - sXt)Xt)pt{Nt - sXt) dNt 

= ^ J ti{P{X^t + Nt + sXt)Xt)tj:{NtXt)pt{Nt)dNt 

< £ 2 / {triNtXt))^pt{Nt)dNt, 

where (c)+ = max{c, 0}, and r = A: in the dense instance case, while r = 1 in the sparse 
instance case. The last inequality follows from the same argument as in (2) when the 
instances are dense, and from the opposite application of Holder’s inequality when the 
instances are sparse, i.e. for sparse instance Xt = xtxj with ||a;i|| = 1, and any P: 

tj:{PxtxJ) < tr(a;ta;7) • ||P||oo = ' 1 = 1- 


Denote: 

z = tv{NtXt) = 2Y,iNth{Xt)ij + Y,i^t)u{Xt)ii. 
i>j i 

Using summation rules for Gaussian variables: 


so that: 


M I 0 , 2ta^Y.{^t)l + ta^Y.^Xt)l | = AA (o, tr(x2)) , 

i>j i 


ft{s) < ^ j{tT{NtXt))+pt{Nt)dNt 

T 

^^2 tr(X2)) [(^) + ] 




:\/tr(X2). 


V27rtcj2 

By the mean value theorem, 

/t(l) - ft{d) = /t(s), for some s G [0,1], 

which implies: 

E\ti{{Pt-Pt)Xt) < 


V27rtcr2 


tr(X2). 


Summing over trials and using Ylt=i 1 /v^ — 2\/T, we bound the hrst sum in Lemma 1 by 
^^ 2 Tmax^tr(x^) ^ bound on the second sum, we get that: 


K < „/ 2rma..tr(X,^) 


TTCr^ 
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The proof is finished by noticing that tr(X^) < n and r = k for dense instances, while 
tr(X^) = 1 and k = 1 for sparse instances. ■ 


Comparing the regret with values of the minimax regret Q{-\/kT) in the standard online 
PCA setting (sparse instance case), and Q{ky/T logn) in the dense instance case Jiazhong 
et al. (2013), we see that the algorithm presented here is suboptimal by a factor of 0(n^/^) 
in the online PCA setting, and by a factor of 0{^/n/ logn) in the dense instances setting. 

4. Conclusions 

In this paper, we studied the online PCA problem and its generalization to the case of 
dense instance matrices. While there are algorithms which essentially achieve the minimax 
regret, such as Matrix Exponentiated Gradient or (Matrix) Gradient Descent, all these 
methods take O(n^) per trial, because they require full eigendecomposition of the data 
matrix. We proposed an algorithm based on Follow the Perturbed Leader approach, which 
uses as a perturbation a random symmetric Gaussian matrix. The algorithm avoids full 
eigendecomposition and only requires calculating the top k eigenvectors. Hence, prediction 
takes 0{kn‘^), while the algorithm achieves the worst-case regret which is only close 

to the minimax regret for standard online PCA setting, and 0(^/n) close to minimax regret 
for generalization of online PCA to a dense instance matrices. Finally, we raised an open 
question, whether a more adaptive version of our algorithm, based on dropout perturbation, 
would achieve the minimax regret. 
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Appendix A. Proof of Lemma 1 

We have: 


Al:fc(A<t + Ai) 


max {tr(P(A<i -h Nt))} 

Pg-p I ' ' - 

tv{Pt{X<t + Nt)) 


= tv{Pt{Xt + Nt- Nt-i)) + triPtiX^t + Nt-i)) 
< tv{Pt{Xt + Nt- Nt-i)) + Xi:k{X^t + Nt-i), 


so that: 

Xi:k{X<t + Nt) - Xi-.kiX^t + Nt-i) < tT{Pt{Xt + Nt- Nt-i)). 

Summing over trials t = 1,... ,T, the terms on the left-hand side telescope and, defining 
No = 0, we get: 


T 

Xi-.k{X<T + Nt) < ^ iT{Pt{Xt + Nt — Nt-i)). 
t=i 
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Since Xi-,k{-) is convex as a maximum over linear functions, Jensen’s inequality implies 
Xi:k{X<T) < E [Xi-,k{X<T + Nt)] and hence: 


Ai 


1 

■.k{X<T) < J]E[tr(Pt(Xi + Art-Art_i)) 


t=l 
T 


T 


t=l t=l 

T T 

iPtXt 


maxlATt - Nt-i} 

P&V 


1 

^E[tr(J 


t=i 


+ ^E 

t=l 
T 

+ ^E[Ai,fe(Afi-Ari_i)] 


t=i 


The lemma follows by plugging the inequality above into the dehnition of the regret (1). 


References 

Walter E. Arnold!. The principle of minimized iterations in the solution of the matrix 
eigenvalue problem. Quarterly of Applied Mathematics, 9:17-29, 1951. 

Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization 
for PCA and PLS. In 2012 50th Annual Allerton Conference on Communication, Control, 
and Computing, pages 861-868, 2012. 

Raman Arora, Andrew Cotter, and Nati Srebro. Stochastic optimization of PCA with 
capped MSG. In NIPS, pages 1815-1823, 2013. 

Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite 
programs. In STOC, pages 227-236. ACM, 2007. 

Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incre¬ 
mental PCA. In NIPS, pages 3174-3182, 2013. 

Christos Boutsidis, Dan Garber, Zohar Shay Karnin, and Edo Liberty. Online Principal 
Components Analysis. In SODA, pages 887-901, 2015. 

James R. Bunch, Christopher P. Nielsen, and Danny C. Sorensen. Numerische Mathematik, 
31(l):31-48, 1978/79. 

Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge Uni¬ 
versity Press, 2006. ISBN 978-0-521-84108-5. 

Jane K. Cullum and Ralph A. Willoughby. Lanczos Algorithms for Large Symmetric Eigen¬ 
value Computations. Cambridge University Press, 1985. 

Kenneth R. Davidson and Stanislaw J. Szarek. Local operator theory, random matrices and 
Banach spaces. In Handbook of the geometry of Banach spaces. Volume 1, pages 317-366. 
Elsevier, North-Holland, Amsterdam, 2001. 

Luc Devroye, Gabor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. 
In COLT, pages 460-473, 2013. 


10 







PCA WITH Gaussian perturbations 


Dan Garber, Elad Kazan, and Tengyu Ma. Online learning of eigenvectors. In ICML, 2015. 

Elad Kazan and Satyen Kale. Projection-free online learning. In ICML, 2012. 

Elad Kazan, Satyen Kale, and Manfred K. Warmuth. On-line variance minimization in 
0{'n?) per trial? In COLT, pages 314-315. Omnipress, 2010. open problem. 

Nie Jiazhong, Wojciech Kotlowski, and Manfred K. Warmuth. Online PCA with optimal 
regrets. In ALT, volume 8139 of LNCS, pages 98-112. Springer, 2013. 

Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision prob¬ 
lems. J. Comput. Syst. Sci., 71(3):291-307, 2005. 

Gergely Neu and Gabor Lugosi. Private communication, 2014. 

Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. 
In ICML, 2015. 

Terence Tao. Topics in Random Matrix Theory. American Mathematical Society, 2012. 

Koji Tsuda, Gunnar Ratsch, and Manfred K. Warmuth. Matrix exponentiated gradient 
updates for on-line learning and Bregman projections. Journal of Machine Learning 
Research, 6:995-1018, 2005. 

Tim van Erven, Wojciech Kotlowski, and Manfred K. Warmuth. Follow the leader with 
dropout perturbations. In COLT, pages 949-974, 2014. 

Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. In COLT, pages 
514-528, 2006. 

Manfred K. Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret 
bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9: 
2287-2320, 2008. 


11 


